Script Identification from Trilingual Documents using Profile Based Features

نویسندگان

  • M. C. Padma
  • P. A. Vijaya
چکیده

In a multi script environment, majority of the documents may contain text information printed in more than one script/language. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, it is proposed to develop a model to identify the script type of a trilingual document printed in Kannada, Hindi and English scripts. The distinct characteristic features of Kannada, Hindi and English scripts are thoroughly studied from the nature of the top and bottom profiles. The proposed model is trained to learn thoroughly the distinct features of each script. Experimentation conducted involved 1500 text lines for learning and 1500 text lines for testing. The k-nearest neighbor classifier is used to classify the test sample. The results are encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99.5% for data set constructed from scanned document images.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Gabor Features Based Script Identification of Lines within a Bilingual/Trilingual Document

The OCR technology for Indian documents is in emerging stage and most of these Indian OCR systems can read the documents written in only a single script. As many commercial and official documents of different states of India are tri-lingual in nature, therefore identification of script and/ or language is one of the elementary tasks for multi-script document recognition. A script recognizer sim...

متن کامل

Handwritten Script Identification: Fusion based Approaches

Script identification is one of the preprocessing steps in any document image processing task. Script identification in printed documents has achieved a greater attention whereas script identification in handwritten documents has achieved less attention from document research community. Almost all the existing works have made attempts on identifying suitable features or classifiers for handwrit...

متن کامل

Handwritten Script Recognition Using DCT, Gabor Filter and Wavelet Features at Line Level

In a country like India where more number of scripts are in use, automatic identification of printed and handwritten script facilitates many important applications including sorting of document images and searching online archives of document images. In this paper, a multiple feature based approach is presented to identify the script type of the collection of handwritten documents. Eight popula...

متن کامل

Identification of Telugu, Devanagari and English Scripts Using Discriminating Features

In a multi-script multi-lingual environment, a document may contain text lines in more than one script/language forms. It is necessary to identify different script regions of the document in order to feed the document to the OCRs of individual language. With this context, this paper proposes to develop a model to identify and separate text lines of Telugu, Devanagari and English scripts from a ...

متن کامل

Global Approach for Script Identification using Wavelet Packet Based Features

In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJCSA

دوره 7  شماره 

صفحات  -

تاریخ انتشار 2010